GENES & DEVELOPMENT 单细胞结果复现
风月难扯,离合不骚
1引言
2020 年 5 月份 在 冷泉港 GENES & DEVELOPMENT 杂志社发表了一篇 关于 YTHDF 1/2/3 识别蛋白功能性补偿的文章,其实是基于前面一篇 Samie R. Jaffrey 发表在 cell 期刊上的文章的后续验证的实验,在 小鼠配子发生 的模型上做了相关的实验。
cell 首篇文章:
该篇文章:
其中有几张单细胞的图,打算尝试复现看看:
图例:
(F) Dimension reduction representation of single-cell RNA-seq (UMAP) measured in adult mouse testis, showing mild expression of Ythdf1 and Ythdf3 in spermatogonia and in Sertoli cells, and more substantial expression of Ythdf2 in spermatogonia and in spermatocytes.
2下载数据
其实这篇文章的单细胞数据用的是别人文章的数据,并不是自己做的:
GSE112393
我们只需要下载 GSE112393_MergedAdultMouseST25_DGE.txt.gz
就行了,这是表达矩阵。
3文献分析方法
算是很常规的简单分析了:
4数据分析
读取表达矩阵
library(patchwork)
library(Seurat)
library(ggplot2)
library(data.table)
library(tidyverse)
library(ggsci)
# load rds
# sce <- readRDS('ythdf123_final.rds')
# load matrix
count <- as.matrix(fread('mergedMouseAdultST25_34633cells_37241genes.dge.txt',
header = T),
rownames = 1)
# check
head(count[1:3,1:5])
# ST1_ATCTTGCACATC ST1_CCGCCTTAGCTN ST1_CCCTGTAGGAGT ST1_CGATTCTACAAT ST1_TTTCCATAGATT
# 0610005C13Rik 0 0 0 0 0
# 0610007P14Rik 0 0 0 1 0
# 0610009B22Rik 6 3 7 4 4
可以看到行为基因,列为每个细胞。
构建 SeuratObject
# 构建 SeuratObject
sce <- CreateSeuratObject(counts = count, project = "mouse",
min.cells = 3, min.features = 500)
# 查看样本细胞量
table(sce$orig.ident)
# INT1 INT2 INT3 INT4 INT5 INT6 SER1 SER2 SER3 SER4 SER5 SER6 SER7 SER8 SPG1 SPG2 SPG3 ST1 ST2 ST3 ST4
# 819 237 1210 653 1480 1453 81 161 451 716 2015 1399 3047 1979 957 937 1212 2324 1642 2720 2363
# ST5 ST6 ST7 ST8
# 1601 2341 1549 1286
这里可以看到有多个样本,每个样本测了不同数量的细胞数量。
数据过滤
# 计算线粒体百分比
sce[["percent.mt"]] <- PercentageFeatureSet(sce, pattern = "^mt-")
# Show QC metrics for the first 5 cells
head(sce@meta.data, 3)
# orig.ident nCount_RNA nFeature_RNA percent.mt
# ST1_ATCTTGCACATC ST1 57507 7060 0.01738919
# ST1_CCGCCTTAGCTN ST1 53253 6746 0.04131223
# ST1_CCCTGTAGGAGT ST1 44129 6401 0.29912303
# Visualize QC metrics as a violin plot
VlnPlot(sce, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 1)
# 过滤
sce <- subset(sce, subset = nFeature_RNA > 500 & nFeature_RNA < 7500 & percent.
降维,归一化和标准化
# 归一化
sce <- NormalizeData(sce, normalization.method = "LogNormalize", scale.factor = 10000)
# 寻找高变基因
sce <- FindVariableFeatures(sce, selection.method = "vst", nfeatures = 5000)
# 标准化
sce <- ScaleData(sce)
# 降维
sce <- RunPCA(sce, features = VariableFeatures(object = sce))
查看有效主成分
# 查看有效PC主成分
sce <- JackStraw(sce, num.replicate = 100)
sce <- ScoreJackStraw(sce, dims = 1:20)
# 使用 JackStrawPlot 函数进行可视化
JackStrawPlot(sce, dims = 1:20)
ElbowPlot(sce)
文章选取了前 15 个主成分来做后续的分析:
# 主成分热图
DimHeatmap(sce, dims = 1:15, cells = 500, balanced = TRUE)
聚类
使用文章里的分辨率:
# 聚类
sce <- FindNeighbors(sce, dims = 1:15)
sce <- FindClusters(sce, resolution = 0.07)
# UMAP降维可视化
set.seed(20220603)
sce <- RunUMAP(sce, dims = 1:15)
查看批次效应:
# 查看样本批次效应
DimPlot(sce, reduction = "umap",group.by = 'orig.ident') +
scale_x_reverse() +
tidydr::theme_dr(xlength = 0.1,ylength = 0.1) +
theme(panel.grid = element_blank(),aspect.ratio = 1)
可以看到基本没什么批次效应,其实数据原文也说了这个问题:
聚类可视化:
# 聚类图
DimPlot(sce, reduction = "umap") +
scale_x_reverse() +
tidydr::theme_dr(xlength = 0.1,ylength = 0.1) +
theme(panel.grid = element_blank(),aspect.ratio = 1)
因为随机种子的问题,是没办法和原文一模一样的。
分群注释
这里我直接根据文章的图来直接添加名字了,正常做法是去找 marker 基因,然后手动去寻找注释:
# 根据 marker基因进行分群注释
new.cluster.ids <- c("RoundSpematid", "Spermatocyte", "Elongating", "Spermatogonia",
"Leydig", "Unknown", "Sertoli", "Endothelial/Myoid")
names(new.cluster.ids) <- levels(sce)
# 细胞分群的重命名
sce <- RenameIdents(sce, new.cluster.ids)
# 重新绘制聚类图
DimPlot(sce, reduction = "umap", label = TRUE, pt.size = 0.5) +
scale_x_reverse() +
tidydr::theme_dr(xlength = 0.1,ylength = 0.1) +
theme(panel.grid = element_blank(),aspect.ratio = 1) +
NoLegend()
寻找 MARker 基因
# find markers for every cluster compared to all remaining cells, report only the positive ones
sce.markers <- FindAllMarkers(sce, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25)
## Calculating cluster RoundSpematid
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04m 24s
## ...
# save top six markers
topsixMarkers <- sce.markers %>% group_by(cluster) %>% top_n(n = 6, wt = avg_log2FC)
# check
head(topsixMarkers,3)
# # A tibble: 3 x 7
# # Groups: cluster [1]
# p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
# <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 0 2.14 0.784 0.221 0 RoundSpematid BC048502
# 2 0 2.10 0.784 0.143 0 RoundSpematid Acrv1
# 3 0 2.09 0.781 0.123 0 RoundSpematid Spaca1
绘制 ythdf1/2/3 基因表达
# 基因表达
FeaturePlot(sce, features = c("Ythdf1", "Ythdf2", "Ythdf3"),ncol = 1) +
theme(aspect.ratio = 1)
# 保存分析的结果
saveRDS(sce, file = "./ythdf123_final.rds")
5结尾
注意,直接从 GEO 上下的表达矩阵,第一列基因名没有列名.fread 读取会有问题,可以 vi 矩阵在第一列加个列名即可。
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
(微信交流群满200人后需收取20元入群费用), 数据代码已上传至QQ群,欢迎加入下载。
群二维码:
老俊俊微信:
知识星球:
所以今天你学习了吗?
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀Molecular Cell 文章 ribosome pausing 结果复现 (终)
◀Molecular Cell 文章 ribosome pausing 结果复现 (四)
◀Molecular Cell 文章 ribosome pausing 结果复现 (三)
◀Molecular Cell 文章 ribosome pausing 结果复现 (二) (PCR 去重)
◀Molecular Cell 文章 ribosome pausing 结果复现 (一)
◀...